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Abstract 

Large numbers of people all over the world read and con- 
tribute to various review sites. Many contributors are under- 
standably concerned about privacy in general and, specifi- 
cally, about linkability of their reviews (and accounts) across 
multiple review sites. In this paper, we study linkability of 
community-based reviewing and try to answer the question: 
to what extent are "anonymous" reviews linkable, i.e., highly 
likely authored by the same contributor? Based on a very 
large set of reviews from one very popular site (Yelp), we 
show that a high percentage of ostensibly anonymous re- 
views can be linked with very high confidence. This is de- 
spite the fact that we use very simple models and equally 
simple features set. Our study suggests that contributors re- 
liably expose their identities in reviews. This has important 
implications for cross-referencing accounts between differ- 
ent review sites. Also, techniques used in our study could be 
adopted by review sites to give contributors feedback about 
privacy of their reviews. 

1. Introduction 

In recent years, popularity of various types of review and 
community-knowledge sites has substantially increased. 
Prominent examples include Yelp, Tripadvisor, Epinions, 
Wikipedia, Expedia and Netflix. They attract multitudes of 
readers and contributors. While the former usually greatly 
outnumber the latter, contributors can still number in hun- 
dreds of thousands for large sites, such as Yelp or Wikipedia. 
For example. Yelp had more than 39 million visitors and 
reached 15 million reviews in late 2010 |2]. To motivate 
contributors to provide more (and more useful/informative) 
reviews, certain sites even offer rewards 

Some review sites are generic (e.g., Epinions) while oth- 
ers are domain-oriented, e.g., Tripadvisor Large-scale re- 
viewing is not Umited to review-oriented sites; in fact, many 
retail sites encourage customers to review their products, 
e.g., Amazon and Netflix. 



With the surge in popularity of community- and peer- 
based reviewing, more and more people contribute to re- 
view sites. At the same time, there has been an increased 
awareness with regard to personal privacy. Internet and Web 
privacy is a broad notion with numerous aspects, many of 
which have been explored by the research community. How- 
ever, privacy in the context of review sites has not been ad- 
equately studied. Although there has been a lot of recent re- 
search related to reviewing, its focus has been mainly on ex- 
tracting and summarizing opinions from reviews Oa. l9l |2C 



as well as determining authenticity of reviews 111 



In the context of community -based reviewing, contributor 
privacy has several aspects: (1) some review sites do not re- 
quire accounts (i.e., allow ad hoc reviews) and contributors 
might be concerned about linkability of their reviews, and 
(2) many active contributors have accounts on multiple re- 
view sites and prefer these accounts not be linkable. The flip 
side of the privacy problem is faced by review sites them- 
selves: how to address spam-reviews and sybil-accounts? 

The goal of this paper is to explore linkability of reviews 
by investigating how close and related are a person's re- 
views. That is, how accurately we can link a set of anony- 
mous reviews to their original author. Our study is based on 
over 1,000,000 reviews and 2± 2,000 contributors from 
Yelp. The results clearly illustrate that most (up to 99% in 
some cases) reviews by relatively active/frequent contribu- 
tors are highly linkable. This is despite the fact that our ap- 
proach is based on simple models and simple feature sets. 
For example, using only alphabetical letter distributions, we 
can link up to 83% of anonymous reviews. We anticipate 
two contributions of this work: (1) extensive assessment of 
reviews' linkability, and (2) several models that quite accu- 
rately link "anonymous" reviews. 

Our results have several implications. One of them is the 
ability to cross-reference contributor accounts between mul- 
tiple review sites. If a person regularly contributes to two 
review sites under different accounts, anyone can easily link 
them, since most people tend to consistently maintain their 
traits in writing reviews. This is possibly quite detrimental 
to personal privacy. Another implication is the ability to cor- 
relate reviews ostensibly emanating from different accounts 
that are produced by the same author Our approach can thus 
be very useful in detecting self -reviewing and, more gener- 
ally, review spam 1 1 1 ] whereby one person contributes from 
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multiple accounts to artificially promote or criticize products 
or services. 

One envisaged application of our technique is to have it 
integrated into review site software. This way, review au- 
thors could obtain feedback indicating the degree of linka- 
bility of their reviews. It would then be up to each author to 
adjust (or not) the writing style and other characteristics. 
Organization: Section |2] provides background informa- 
tion about techniques used in our experiments. The sample 
dataset is described in Section [3] and study settings are ad- 
dressed in Section|4] Next, our analysis methodology is pre- 
sented in Section |5] Section \5A\ discusses issues stemming 
from this work and Section |6] sketches out some directions 
for the future. Then, Section Q overviews related work and 
Sectioij8]concludes the paper. 

2. Background 

This section provides some background about statistical 
tools used in our study. We use two well-known approaches 
based on: (1) Naive Bayes Model fl?], (2) Kullback-Leibler 
Divergence Metric |5]. We briefly describe them below. 

2.1 Naive Bayes Model 

Naive Bayes Model (NB) is a probabilistic model based on 
the eponymous assumption stating that all features/tokens 
are conditionally independent given the class. Given tokens: 
Ti, 12, Tn in document D, we compute conditional prob- 
ability of a document class C as follows: 



P(C|D) = P(C|Ti,T2,...,T„) : 



P(Ti,T2,...,T„|C)P(C) 



P(Ti,T2,...,T„) 
According to the Naive Bayes assumption, 

P(Ti,r2,...,T„|C) = P(Ti|C)P(T2|C) P(T„|C) 

Therefore, 

P(Ti|C)P(r2|C) PiTn\C)P(C) 



P{C\Tl,T2,...,Tn) 



P{Tl,T2,...,Tn) 



To use NB for classification, we return the class value with 
maximum probability: 

Class = argmaxcP(C\D) = argmaxc P(C\Ti,T2, ...,Tn) (1) 

Since P(Ti, T2, T„) is the same for all C values, and 
assuming P(C) is the same for all class values, the above 
equation is reduced to: 

Class = argmaxc PiTi\C)P{T2\C) P{T„|C) 

Probabilities are estimated using the Maximum-Likelihood 
estimator [5J as follows: 



P{T,\C) ■■ 



Num of Occurrences of Ti in D 
Num of Occurrences of all Tokens in D 



We smooth the probabilities with Laplace smoothing III6II as 
follows: 



pm\c) : 



Num of Occurrences of Ti in D + 1 
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Figure 1. CDF for: (a) number of reviews per contributor, 
and (b) average review size (number of words) per contribu- 
tor. 



2.2 Kullback-Leibler Divergence Metric 

Kullback-Leibler Divergence (KLD) metric measures the 
distance between two distributions. For any two distributions 
P and Q, it is defined as: 



DkiiPWQ) 



yp(i)log{^) 



KLD is always positive: the closer to zero, the closer Q 
is to P. It is an asymmetrical metric, i.e., Dki{P\\Q) 
Dki{Q\\P)- To transform it into a symmetrical metric, we 
use the following formula (that has been used in II23I1 ): 

SymDt:i{P,Q) = 0.5 x (Dt:iiP\\Q) + Dki{Q\\P)) (2) 

Basically, SymDki is a symmetrical version of Dki that 
measures the distance between two distributions. As dis- 
cussed below, it is used heavily in our study. In the rest of 
the paper, the term "KLD" stands for SymDki 

3. Data Set 

Clearly, a very large set of reviews authored by a large num- 
ber of contributors is necessary in order to perform a mean- 
ingful study. To this end, we collected 1,076,850 reviews 
for 1, 997 contributors from yelp . com, a very popular site 
with many prolific contributors. As shown in Figure [T(a)] the 
minimum number of reviews per contributor is 330, the max- 
imum - 3; 387 and the average - 539 reviews, with a stan- 
dard deviation of 354. For the purpose of this study, we lim- 
ited authorship to prolific contributors, since this provides 
more useful information for the purpose of review linkage. 

Figure |l(a)| shows the Cumulative Distribution Function 
(CDF) of the number of reviews per contributor. 50% of 
the contributors authored fewer than 500 reviews and 76% 
authored fewer than 600. Only 6% of the contributors exceed 
1, 000 reviews. 

Figure [T(b)| shows the CDF for average review size (num- 
ber of words) per contributor. It shows that 50% of the con- 
tributors write reviews shorter than 140 words (on average) 



Num of Occurrences of all Tokens in D + Num of Possible Token 



' Note that, under certain conditions, NB and asym- 
metrical KLD models could be equivalent. That is, 
argmaxciasaP{dass\T\,T2, ■■■,Tn) is equivalent to 
argminciassDkl{Token.distribution\\Class.distribution), where 
Ti,T2, ■■■Tn are the tokens of a document D and Token jlistribution 
is their derived distribution. The proof for this equivalency is in f23ll . 
However, this equivalence does not hold when we use the symmetrical 
version SymD^i- 
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and 75% - have average review size smaller than 185. Also, 
97% of contributors write reviews shorter than 300 words. 
The overall average review size is relatively small - 149 
words. 



4. Study Setting 

As mentioned earlier, our central goal is to study linkability 
of relatively prolific reviewers. Specifically, we want to un- 
derstand - for a given prolific author - to what extent some 
of his/her reviews relate to, or resemble, others. To achieve 
that, we first randomly order the reviews of each contributor. 
Then, for each contributor U with Nu reviews, we split the 
randomly ordered reviews into two sets: 

1. First Njj ~ X reviews: We refer to this as the identified 
record (IR) of U. 

2. Last X reviews: These reviews represent the full set of 
anonymous reviews of U from which we derive several 
subsets of various sizes. We refer to each of these sub- 
set as an anonymous record (AR) of U. An AR of size 
i consists of the first i reviews of the full set of anony- 
mous reviews of U. We vary the AR size for the purpose 
of studying the user reviews linkability under different 
numbers of anonymous reviews. 

Since we want to restrict the AR size to a small portion of 
the complete user reviews set, we restrict X to 60 as this rep- 
resents less than 20% of the minimum number of reviews for 
authors in our set (330 total). We use the identified records 
(IRs) of all contributors as the training set upon which we 
build models for linking anonymous reviews. (Note that the 
IR size is not the same for all contributors, while the AR 
size is uniform.) Thus, our problem is reduced to matching 
an anonymous record to its corresponding IR. Specifically, 
one anonymous record serves as input to a matching/linking 
model and the output is a sorted list of all possible account- 
ids (i.e., IR sets) Usted in a descending order of probability, 
i.e., the top-ranked account-id corresponds to the contribu- 
tor whose IR represents the most probable match for the in- 
put anonymous record. Then, if the correct account-id of the 
actual author is among top T entries, the matching/linking 
model has a hit; otherwise, it is a miss. Consequently, our 
study boils down to exploring matching/linking models that 
maximize the hit ratio of the anonymous records for varying 
values of both T and AR sizes. We consider three values of 
T: 1 (perfect hit), 10 (near-hit) and 50 (near-miss). Whereas, 
for the AR size, we experiment with a wider range of values: 
1,5, 10, 20, 30, 40, 50 and 60. 

Even though our focus is on the linkability of prolific 
users, we also attempt to assess performance of our models 
for non-prolific users. For that, we slightly change the prob- 
lem setting by making the IR size smaller; this is discussed 
in Section|5331 



NB 


Naive Bayes Model 


KLD 


Syinnietrical Kullback-Leibler Divergence Model 


R 


Token Type: rating, unigram or digram 


LR 


Linkability Ratio 


AR 


Anonymous Record 


IR 


Identified Record (con'esponding to a certain reviewer) 


SyrnDKLoilR, AR) 


symmetric KLD distance between IR and AR 


SymDjiLD.r 


symmetric KLD of rating tokens 


SymDKLr).c 


symmetric KLD of category tokens 


SymDKLDj 


symmetric KLD of lexical(unigram or digram) tokens 


SyniDiiLD.r.c 


symmetric KLD of rating and category tokens 


SymDKLDJ.r.c 


symmetric KLD of lexical, rating and category tokens 



Table 1. Notation and abbreviations. 



5. Analysis 

As mentioned in Section |2] we use Naive Bayes (NB) and 
KuUback-Leibler Divergence (KLD) models. Before analyz- 
ing the collected data, we tokenize all reviews and extract 
four types of tokens: 

1. Unigrams: set of all single letters. We discard all non- 
alphabetical characters. 

2. Digrams: set of all consecutive letter-pairs. We discard 
all non-alphabetical characters. 

3. Rating: rating associated with the review. (In Yelp, this 
ranges between 1 and 5). 

4. Category: category associated with the place/service be- 
ing reviewed. There are 28 categories in our dataset, 

Wliy sucli simple tokens? Our choice of these four primi- 
tive token types might seem trivial or even naive. In fact, ini- 
tial goals of this study included more "sophisticated" types 
of tokens, such as: (1) distribution of word usage, (2) sen- 
tence length in words, and (3) punctuation usage. We orig- 
inally planned to use unigrams and digrams as a baseline, 
imagining that (as long as all reviews are written in the same 
language - English, in our case) single and double-letter dis- 
tributions would remain more-or-less constant across con- 
tributors. However, as our results clearly indicate, our hy- 
pothesis was wrong. 

In the rest of this section, we analyze results produced 
by NB and KLD models. Before proceeding, we re-cap 
abbreviations and notation in Table [T] 

5.1 Methodology 

We begin with the brief description of the methodology for 
the two models. 

5.1.1 Naive Bayes (NB) Model 

For each account IR, we built an NB model, P{tokeni\IR), 
from its identified record. Probabilities are computed us- 
ing the Maximum-Likelihood estimator |5] and Laplace 
smoothing [ 1^ as shown in|2] We then construct four mod- 
els corresponding to the four aforementioned token types. 
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NB Model-Rating 



KLD Model-Rating 



NB Model-Unigram 



KLD Model-Unigram 
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Figure 2. LRs for NB and KLD models rating and category 
tokens 
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Figure 3. LRs of NB and KLD models for unigrams and 
digrams 



That is, for each IR, we have P, 



unigrams ^digram-: ^category 



and P, 



rating • 



To link an anonymous record AR to an account IR with 
respect to token type R, we first extract all i?,-type tokens 
from AR, Tr^,Tr^, ....Tr^ (Where Tr^ is the i-th R to- 
ken in AR). Then, for each IR, we compute the probabil- 
ity Pr{IR\Tr^ , Tr^ , ....Tr^ ). Finally, we return a Hst of ac- 
counts sorted in decreasing order of probabilities. The top 
entry represents the most probable match. 

5.1.2 Kullback-Leibler Divergence (KLD) Model 

We use symmetric KLD (see Section |2]l to compute the 
distance between anonymous and identified records. To do 
so, we first compute distributions of all records, as follows: 

Num. of Occurrences of Tokerii 

DistJtokenyT okerii) = 

Num of Occurrences of all Tokens 

To avoid diyiaon by 0, we smooth distributions via Laplace 
smoothing [16], as follows: 

Distdoken{T okeni) = 

Num of Occurrences of Tokeni + 1 
Num of Occurrences of all Tokens + Num of Possible Tokens 

As before, we compute four distributions. To link AR with 
respect to token type R, we compute SymDki between the 
distribution of R for AR and the distribution of R for each 
IR. Then, we return a list sorted in ascending order of 
SyrnDKLoilR, AR) values. The first entry represents the 
account with the most likely match. 

5.2 Study Results 

We now present the results corresponding to our four tokens. 
Then, in the next section, we experiment with some combi- 
nations thereof. 

5.2.1 Non-Lexical: Rating and Category 

Figure|2]shows Top-1, Top-10, and Top-50 plots of the link- 
ability ratios (LRs) for NB and KLD models for several 



anonymous record sizes when either rating or category is 
used as the token. Not surprisingly, an increase in the anony- 
mous record size causes an increase in the LR. Figures [2(a)| 
and |2(b)| show LRs when rating token alone is used. In the 
Top-1 plot, LRs are low and the highest ratio is 2.5%/L9% 
in NB/KLD for an anonymous record size of 60. How- 
ever, in Top-10 and Top-50 plots, LRs become higher and 
reach 13.9%(n.9%) and 34.8%(33.3%) in Top-10 and Top- 
50 plots, respectively, in NB(KLD) for the same anonymous 
record size. Figures [2(c)] and [2(d)] show LRs for the category 
token. In Top-1, the highest LR is 13.6%/4.7% in NB/KLD 
for anonymous record size of 60. A significant increase oc- 
curs in LRs in Top-10 and Top-50 plots: 40.4%(17.4%) and 
68.2%(38.3%), respectively, in NB(KLD) model. The cat- 
egory is clearly more effective than the rating token. Ad- 
ditionally, we observe that NB performs better than KLD 
model, especially, for the category token. 

We conclude that rating- and category-based models are 
only somewhat helpful, yet insufficient to link accounts for 
many anonymous records. However, it turns out that they 
are quite useful when combined with other lexical tokens, as 
discussed in Section l53] below. 

5.2.2 Lexical: Unigram and Digram 

Figures [3(a)| and [3(b)| depict LRs (Top-1 and Top-10) for 
NB and KLD with the unigram token. As expected, with 
the increase in the anonymous record size, the LR grows: 
it is high in both Top-1 and Top-10 plots. For example, 
in Top-1 of both figures, the LRs are around: 19%, 59% 
and 83% for anonymous record sizes of 10, 30 and 60, 
respectively. Whereas, in Top-10 of both figures, the LRs are 
around: 45.5%, 83% and 96% for same record sizes. This 
suggests that reviews are highly linkable based on trivial 
single-letter distributions. Note that the two models exhibit 
similar performance. 
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Figures |3(c)| and |3(d)| consider the digram token. In 
both models, the LR is impressively high: it gets as high 
as 99.6%/99.2% in Top-1 for NB/KLD for an AR size of 
60. For example, the Top-1 LRs in NB are: 11.7%, 62.9%, 
87.5% and 97.1%, for respective AR sizes of 1, 5, 10 and 
20. Whereas, in KLD, the Top-1 LRs for record sizes of 10, 
30 and 60 are: 1.9%,74.9% and 99.2%, respectively. 

Unlike unigrams - where LRs in both models are compa- 
rable - KLD in digram starts with LRs considerably lower 
than those of NB. However, the situation changes when the 
record size reaches 50, with KLD performing comparable to 
NB. One reason for that could be that KLD improves when 
the distribution of ARs is more similar to that of correspond- 
ing identified records; this usually occurs for large record 
sizes, as there are more tokens. 

Not surprisingly, in both lexical and non-lexical models, 
larger AR sizes entail higher LRs. With NB, a larger record 
size implies that, a given AR has more tokens in common 
with the corresponding IR. Thus, an increase in the pre- 
diction probability P(/i?|Ti, Ta, ...T„). For KLD, a larger 
record size causes the distribution derived from the AR to be 
more similar to the one derived from the corresponding IR. 

5.3 Improvement I: Combining Lexical with 
non-lexical Tokens 

In an attempt to improve the LR, we now combine the non- 
lexical token with its lexical counterparts. 

5.3.1 Combining Tokens Methodology 

This is straightforward in the NB. We simply increase 
the list of tokens in the unigram- or digram-based NB 
by adding the non-lexical tokens. Thus, for every AR, we 
have P{lexicalJtokeni\IR), P{categorydokeni\lK) and 
P{rate-tokeni\lK). 

Combining non-lexical with lexical tokens in KLD is 
less clear. One way is to simply average SymDKLD val- 
ues for both token types. However, this might degrade the 
performance, since lexical distributions convey much more 
information than their non-lexical counterparts. Thus, giv- 
ing them the same weight would not yield better results. 
Instead, we combine them using a weighted average. First, 
we compute the weighted average of rating and category 
SymDKLD- 

SymDKLD.rAP,Q) = 

P X SymDKLD.riP, Q) + (1 - /3) X SymDKLD AP, Q) 

Then, we combine the above with SymDKLD of the lexical 
tokens to compute the final weighted average: 

SymDKLDJ.rAP^Q) = 

a X SymDKLDj(P, Q) + (1 - a) x SymDKLD.rAP> Q) 
Thus, our goal is to get the right /3 and a values. Intuitively, 
SymDkijexicai should have more weight as it carries more 
information. Since there is no clear way of assigning weight 
values, we experiment with several choices and pick the one 
with the best performance; we discuss the selection process 
below. We experiment only within the IR set and then verify 
the results generalizes to the AR. This is done as follows: 



Combining Rating and Category Combining Rating, Category and Unigram 

3.5 T— — ■ ■ — ■ ■ — r— — ■ ■ — ■ ■ 1 70 I 

o 60 . "■" 
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2.5 ■ \ ^ 40 / 




Beta Values Alpha Values 



(a) (b) 



Combining Rating, Category and Digram 

80 I ■ ■ ' ' ' 

70 ' ' ■ ■ ■ ' ■ • 




Alpha Values 

(c) 



Figure 4. Results of combining different tokens using dif- 
ferent /? and a values 

First, for every IR, we allocate the last 30 reviews as a 
testing record and the remainder - as a training record. Then, 
we experiment with SymDKLD.r.c using several (3 values 
and set (3 to the value that yields the highest LR based on the 
tested records. Then, we experiment with SymDKLDj.r.c 
using several a values and, similarly, pick the one with the 
highest LR. 

Since (3 or a could assume any values, we need to restrict 
their choices. For /3, we postulate that its optimal value is 
close to 0.5 since LRs for rating and category are compara- 
ble. Thus, we experiment with a range of values, from to 
1.00 in 0.1 increments. For a, we expect the optimal value 
to exceed 0.9, since the LR for lexical tokens is significantly 
higher than for non-lexical ones. Therefore, we experiment 
with the weighted average by varying a between 0.9 and 
0.99 in 0.01 increments. 

If the values exhibit an increasing trend (i.e., SymDKLDj. 
at a of 0.99 is the largest in this range) we continue exper- 
imenting in the 0.99 1.00 range in 0.001 increments. 

Otherwise, we stop. For further verification, we also experi- 
ment with smaller a values: 0.0, 0.3, 0.5, 0.7, and 0.8, all of 
which yield LRs significantly lower than 0.9 for both the un- 
igram and digram. We acknowledge that we may be missing 
a or /3 values that could further optimize SymDKLDj.r.c- 
However, results in sections 15.3.2! and 15.3.31 show that our 
selection yields good results. 

Figure [4(a)| shows LRs (Top-1) for /3 values. The LR 
gradually increases until it tops off at 3.4% with /3 = 0.5 and 
then it gradually decreases. Figure [4(b)| shows LRs (Top-1) 
for a values in the unigram case. The LR has an increas- 
ing trend until it reaches 67.8% with a = 0.997 and then 
it decreases. Figure [4(c)| shows LRs (Top-1) for a values in 
the diagram case where it tops off at 75.9% with a — 0.97. 
Thus, the final values are 0.5 for /3 and 0.997/0.97 for alpha 
in the unigram/digram case. Even though we extract a and 
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Figure 5. LRs of NB and KLD for combining ratings and 
categories 



/3 values by testing on a record size of 30, the results in fol- 
lowing sections show that the derived weights are effective 
when tested on ARs of other sizes. 

5.3.2 Combining Rating and Category - Results 

Figures [5(a)] and [5(b)| show Top-1 plots for NB and KLD 
models when rating and category tokens are combined or 
used separately. Clearly, combining the tokens significantly 
improves LRs in several record sizes. In NB, the gain in 
Top-1 LRs ranges from 2.5-15.9%/3.5-27.1% over the cat- 
egory/rating based model for most record sizes. For exam- 
ple, LRs increase from 5.8(1.4)% and 13.6(2.5)% in cate- 
gory(rating) based model to 14.5% and 29.5% in NB com- 
bined model for record sizes of 30 and 60, respectively. 
In Top-50, the LR could reach as high as 87.7%, versus 
68.2(34.8)% in the category (rating) based model for a 
record size of 60. 

In KLD, the gain in Top-1 LRs ranges from 1.7-9.9%/1.6- 
12.7% over category/rating based model for most record 
sizes. For example, it leaps from 1.1(1.3)% and 4.7(1.9)% 
in category (rating) based model to 3.7% and 14.6% in KLD 
combined model for record sizes of 30 and 60, respectively. 
The gain is even higher in Top-50 where it reaches 69.1%, 
versus 38.3(33.3)% in the category (rating) based model for 
a record size of 60. These results show that combining rating 
and category tokens is very effective in increasing LRs in 
both NB and KLD models. 

5.3.3 Combining Lexical with Non-Lexical Tokens 

Figures [6(a)| and [6(b)| show Top-1 and Top- 10 plots in NB 
and KLD models of unigram tokens before and after com- 
bining them with rating and category tokens. Adding non- 
lexical tokens to unigrams substantially increases LRs in 
several record sizes. In NB, the gain in Top-1 LRs ranges 
from 0.25-18.9% (1.4 - 15.7% for Top-10 LRs). In KLD, the 
gain in Top-1 LRs ranges from 2.5-11.9 (2-7.8% in Top-10 
LRs) for most record sizes. These findings shows how effec- 
tive is combining the non-lexical tokens with the unigrams. 
In fact, we can accurately identify almost all ARs. 

Figures |6(c)| and |6(d)| show the effect of adding ratings 
and categories to digrams. The overall effect is minuscule: 
in NB (KLD) model, the increase in Top-1 LRs ranges from 
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Figure 6. LRs for NB and KLD for combining ratings and 
categories with unigrams or digrams 
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Figure 7. LRs for NB and KLD in full and restricted iden- 
tified set 



0.3-1.8% (0.2-2.7%) for most record sizes. The increase is 
very similar in Top-10 plots. 

5.3.4 Restricting Identified Record Size 

In previous sections, our analysis was based on using the full 
data set. That is, except for the anonymous part of the data 
set, we use all of the user reviews as part of our identified set. 
Although the LR is high in many cases, it is not clear how 
the models will perform when we restrict the IR size. To 
this end, we re-evaluate the models with the same problem 
settings, however, with a restricted IR size. We restrict the 
IR size to the AR size; both randomly selected without 
replacement. 

Figures |7(a)| and |7(b)| show two Top- 1 plots in NB and 
KLD models: one plot corresponds to the restricted identi- 
fied set and the other - to the full set. Tokens used in the 
models consist of digrams, ratings and categories (since this 
combination gives the highest LR). Unlike the previous sec- 
tions, where NB and KLD behaved similarly, the two mod- 
els now behave differently when restricting the identified set. 
While NB performs better than KLD on the full set, the lat- 
ter performs much better than NB when the identified set is 
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restricted. In fact, in some cases, KLD performs better when 
the set is restricted. 

The reason for this improved KLD performance might be 
the following: in the symmetric KLD distance function, the 
distributions of both the IR and AR have to be very close in 
order to match regardless of the size of the IR; unlike the NB, 
where larger training sets would lead to better estimates of 
the token probabilities and thus more accurate predictions. 

In KLD, we achieve high LRs for many record sizes. For 
example, Top-1 LRs in the restricted set are 74.5%, 88% and 
97.1% when the anonymous (and identified) record sizes are 
30, 40 and 60, respectively. Whereas, the LRs in the full set 
for the same AR sizes are: 76.5% , 93% and 99.4%. When 
the record size is less than 30, KLD performs better in the 
restricted set than the full one. For example, when the AR 
size is 20, the LR in the restricted set is 50.1% and 34.3% 
in the full set. In NB, Top-1 LR in the restricted set is lower 
than the full set. For instance, it is 20.8%, 35.3% and 62.4% 
for AR sizes of: 30, 40 and 60, respectively. Whereas, for the 
same sizes, the LR is more than 99% in the full set. 

This result has one very important implication: even with 
very small IR sizes, many anonymous users can be identi- 
fied. For example, with only IR and AR sizes of only 30, 
most users can be accurately linked (75% in Top-1 and 90% 
in Top- 10). This situation is very common since many real- 
world users generate 30 or more reviews over multiple sites. 
Therefore, even reviews from less prolific accounts can be 
accurately linked. 

5.3.5 Improvement II: Matching all ARs at Once 

We now experiment with another natural strategy of attempt- 
ing to match all ARs at once. 

5.3.6 Methodology 

In the previous section, we focused on linking one AR at 
a time. That is, ARs were independently and incrementally 
linked to IRs (accounts/reviewer-ids). One natural direction 
for potential improvements is to attempt to link all ARs at the 
same time. To this end, we construct algorithm Match^AllO 
in Figure |8] as an add-on to the KLD models suggested in 
previous sections. 

SymDKLD{IRj , ARi) symmetrically measures the dis- 
tance between their (IRj's and ARis) distributions. Since 
every AR maps to a distinct IR (ARi maps to IRi), it 
would seem that lower SymDKLD would lead to a bet- 
ter match. We use this intuition to design Match_All{). 
As shown in the figure, Match_All{) picks the small- 
est SyniDKLDilRj, ARi) as the map between IRj and 
ARi and then deletes the pair {IRj , Vkj ) from all re- 
maining lists in Sl- The process continues until we com- 
pute all matches. Note that, for any ListABk^ i^Rj^^kj) 
is deleted from the list only when there is another pair 
(IRjjVij) in ListAR,, such that SymDKLD{IRj, ARi) < 
SyrnDKLoilRj, ARk), and IRj has been selected as the 



Algorithm Match.All: Pseudo Code 



Input: 



Output 

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 



(1) Setof ARs: S,4fl = {ARi, AR2, ARn} 

(2) Set of reviewer-ids / identified records: 
Sir = {IRl,IR2,...,IRn} 

(3) Set of matching lists for each AR: 
Sl = {ListARi , ■ ■ , ListAR„ } 

Matching list: 5a/ = {(I Ri^ , ARj^), (IRi^, ARjJ} 
set Sm = 
While \Sar\¥=0: 

Find ARi with smallest SymDxLD in all lists in S^ 

Get corresponding reviewer-id IRj 

Add {I R J, ARi) to Sm 

Delete ARi from Sar 

Delete ListAR from S^ 

For each Listt in Sl , 

Delete tuple containing IRj from Listt 

End For 
End While 



NOTE 1: ListAR in Sl is a list of pairs {IRj,Vij) where Vij = 
SymDKLD{IRj,ARi), for ah j 

NOTE 2: ListAR is sorted in increasing order of Vij, i.e., IRj with lowest 
SymDKLD {IRj ) ARi ) at the top. 

Figure 8. Pseudo-Code for matching all ARs at once. 



match for ARilThs output of the algorithm is a match-list: 
Sm = {{IRn , ARj,), {IR,^ , ARjJ}. 

We now consider how Match_All{) could improve the 
LR. Suppose that we have two ARs: ARi and ARj along 
with corresponding sorted lists Li and Lj and assume that 
IRi is at the top of each list. Using only KLD, we would 
return IRi for both ARs and thus miss one of the two. 
Whereas, Match.All, would assign IRi to only one AR 
- the one with the smaller SymDKLoilRi, •••) value. We 
would intuitively suspect that SymDKLD{IRi, ARi) < 
SymDKLDilRi, ARj) since IRi is the right match for 
ARi and thus their distributions would probably be very 
close. If this is the case, Match-All would delete IRi (erro- 
neous match) from the top of Lj which could help clearing 
up the way for IRj (correct match) to the top of Lj . 

We note that there is no guarantee that Match-All{) wiU 
always work: one mistake in early rounds would lead to 
others in later rounds. We believe that Match-AllO works 
hettsr if SymDKLDilRi, ARi) < SymDKLD{IRj, ARi) 
ij ^ i) holds most of the time. 

In the next section, we show the results of Match-AllO 
when we experiment with the KLD model with digram, 
rating and category tokens 0. 

5.3.7 Results 

Figures [9(a)| and |9(b)| show the effect of Match_All{) on 
Top-1 LRs in both the restricted identified set and the full 
identified set, respectively. The combination of diagram, rat- 
ing and category tokens are used. Each figure shows two 
Top-1 plots: one for the LR after using Match-All and the 
other - for the LR before using it. Clearly, Match_All is ef- 



- We also tried Match-AllQ with the NB model and it did not improve the 
LR. 
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Match_AII - Restricled Identified Set Match_AII - Full Identilied Set 




Anonymous Record Size Anonymous Record Size 

(a) (b) 



Figure 9. Effects of Match.AllO on LRs in full and re- 
stricted identified set: before and after plots 



fective in improving the LR for almost all record sizes. For 
the restricted set, the gain in the LR ranges from 1.6-16.4% 
for nearly all AR sizes. A Similar increase is observed in 
the full set that ranges from 1-23.4% for most record sizes. 
This shows that the Match_All is very effective when used 
with diagram, rating and category tokens. The privacy im- 
plication of Match_All is important as it significantly in- 
creases the LR for small ARs in the restricted set. This shows 
that privacy of less prolific users is exposed even more with 
Match^ll. 

5.4 Study Summary 

We now summarize the main findings and conclusions of our 
study. 

1. The LR becomes very high - reaching up to ^ 99.5% in 
both KLD and NB when using only digram tokens. (See 
Section l5.2.2b . 

2. Surprisingly, using only unigrams, we can link up to 
83% in both NB and KLD models, with 96% in Top-10. 
(See Section 15. 2. 2b . This suggests that reviewers expose 
a great deal merely from their single letter distributions. 

3. Even with small record sizes, we accurately link a signif- 
icant ratio of ARs. Specifically, for AR sizes of 5 and 10 
(using NB with diagrams), we can accurately link 63% 
and 88% ARs, respectively. (See Section l5.2.2b . 

4. Rating and category tokens are more useful if combined 
where 88%/69% of ARs (size 60) fall into Top-50 in 
NB/KLD. (See Section|5321l. 

5. Non-lexical tokens are very useful in tandem with lexical 
tokens, especially, the unigram: we observe a ~19%/12% 
Top-1 LR increase in NB/KLD for some cases. (See 
Section|5331l. 

6. Relying only on unigram, rating and category tokens, we 
can accurately link 96%/92% of the ARs (size 60) in 
NB/KLD. (See Section|5331l. 

7. Restricting the IR size does not always degrade linkabil- 
ity. In KLD, we can link as many as 97% ARs when the 
IR size is small. (See Section l5.3.4b . 



8. Linking all ARs at once (instead of each independently) 
helps improve accuracy. The gain is up to 16/23% in 
restricted/full set. (See Section l5.3.7b . 

9. Generally, NB performs better than KLD when we use 
the full identified set and KLD performs better when we 
use the restricted identified set. 

sectionDiscussion Implications: We believe that the re- 
sults of, and techniques used in, this study have several im- 
plications. First, we demonstrated the practicality of cross- 
referencing accounts (and reviews) among multiple review 
sites. If a person contributes to two sites under two identi- 
ties, it is highly likely that sets of reviews from these sites 
can be linked. This could be quite detrimental to contribu- 
tors' privacy. 

The second implication is the ability to correlate - on the 
same review site - multiple accounts that are in fact manip- 
ulated by the same person. This could make our techniques 
very useful in detecting review spam [TT], whereby a con- 
tributor authors reviews under different accounts to tout (also 
self -promote) or criticize a product or a service. 

Prolific Users: While there are clearly many more oc- 
casional (non-prolific) reviewers than prolific ones, we be- 
lieve that our study of prolific reviewers is important, for 
two reasons. First, the number of prolific contributors is still 
quite large. For example, from only one review site - Yelp 
- we identified ^ 2, 000 such reviewers. Second, given the 
spike of popularity of review sites [2], we believe that, in 
the near future, the number of such prolific contributors will 
grow substantially. Also, even many occasional reviewers, 
with the passage of time, will enter the ranks of "prolific" 
ones, i.e., by slowly accumulating a sufficient corpus of re- 
views over the years. Nevertheless, our study suggests that 
privacy is not high even for non-prolific users, as discussed 
in Section 15.3.5! For example, when both IR and AR sizes 
are only 20 (i.e., total per user contribution is 40 reviews), 
we can accurately link half of anonymous records to their 
reviewers. 

Anonymous Record Size: Our models perform best 
when the AR size is 60. However, for every reviewer in 
our dataset, 60 represents less than 20% of that person's 
total number of reviews. Also, using NB coupled with di- 
gram, rating and category tokens, we can accurately link 
most anonymous records when the AR size is only 10. In- 
terestingly, the AR size of 10 represents only 3% of the 
minimum user contribution. 

Unigram Tokens: While our best-performing models are 
based on digram tokens, we also obtain high linkability re- 
sults from unigram tokens that reach up to 83% (96% in the 
Top 10) in NB or KLD. The results improve to 96/92% when 
we combine unigrams with rating and category tokens. Note 
that the number of tokens in unigram-based models is 59 
(26) tokens with (without) combining them with rating and 
category tokens. Whereas, the number of tokens in diagram- 
based models is 676 (709 when combined with rating and 
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category tokens). This makes linkability accuracy based on 
unigram models very comparable to its diagram counter- 
part, while the number of tokens is significantly fewer This 
impUes a substantial reduction in resources and processing 
power in unigram-based models which would make them 
scale better. For example, if we assume that the attacker 
wants to link a set of anonymous reviews to many large 
review datasets, unigram-based models would scale better, 
while maintaining the same level of accuracy. 

Potential Countermeasures: One concrete application 
of our techniques is via integration with the review site's 
front-end software in order to provide feedback to authors 
indicating the degree of linkability of their reviews. For 
example, when the reviewer logs in, a linkability nomi- 
nal/categorical value (e.g. high, medium, and low) could 
be shown indicating how his/her last k reviews (where k is 
small, e.g., 5 or 10) are linkable to the rest. It would then be 
up to to the individual to maintain or modify their review- 
ing patterns to be less linkable. Another way of countering 
linkability is for the front-end software to automatically sug- 
gest a different choice of words that are less revealing (less 
personal) and more common among many users. We suspect 
that, with the use of such words, reviews would be less link- 
able and lexical distributions for different users would be 
more similar 



6. Future Work 

Although our results point at high linkability of reviews, 
there remain many open questions. First, the anonymous 
records are not highly linkable when their sizes are very 
small, e.g., 1 or 5. As part of future work, we plan to improve 
linkability on very small anonymous records. In addition, al- 
though we take advantage of ratings and categories to boost 
LRs, we need to further explore usage of other non-textual 
features, such as sub-categories of places, products and ser- 
vices reviewed as well as the length of reviews. In fact, it 
would be interesting to see how the LR can be improved 
without resorting to lexical features, since they generally en- 
tail heavy processing. We also plan to implement counter- 
measures techniques described in Section 15.41 and examine 
their efficacy. 

Moreover, we plan to investigate LRs in other preference 
databases, such as music/song ratings, and check whether 
contributors inadvertently link their reviews through prefer- 
ences. It would be interesting to see how to leverage tech- 
niques used in recommender systems (for future rating pre- 
diction) to increase LRs. 

In Section 15.3.51 we showed how to improve LRs by 
linking all anonymous records at once. We intend to further 
investigate the effect on the LR of the number anonymous 
records when each record belongs to a different reviewer 



7. Related Work 

Many authorship analysis studies have appeared in the liter- 
ature. Among the most prominent recent studies are: 14. loj 
241. The study in ifioll proposed techniques that extract fre- 
quent pattern write-prints that characterized one (or a group 
of) authors. The best achieved accuracy was 88% when iden- 
tifying an author, from a single anonymous message, from 
a small set of four and with training set size of forty mes- 
sages per author. In ll24ll . a framework for author identifica- 
tion for on-line messages was introduced where four types 
of features are extracted: lexical, syntactic, structural and 
content-specific. Three types of classifiers were used for au- 
thor identification: Decision Trees, Back Propagation Neural 
Networks and Support Vector machines. The last one outper- 
formed the others, achieving 97% in a set of authors less than 
20. The work in ^ also considered author identification and 
similarity detection by incorporating a rich set of stylistic 
features along with a novel technique(based on Karhunen- 
Loeve-transforms) to extract write-prints. These techniques 
performed well, reaching as high as 91% in identifying the 
author of anonymous text from a set of 100. The same ap- 
proach was tested on a large set of Buyer/Seller Ebay feed- 
back comments collected from Ebay. Such comments typi- 
cally reflect one's experience when dealing with a buyer or a 
seller. Unlike our general-purpose reviews, these comments 
do not review products, services or places of different cate- 
gories. Additionally, the scale of the problem was different 
and the analysis was performed for 100 authors, whereas, 
our analysis involved ^ 2, 000 reviewers. 



A problem very similar to ours was explored in |17|] . 
It focused on identifying authors based on reviews in both 
single- and double-blinded peer-reviewing processes. Naive 
Bayes classifier was used - along with unigrams, bigrams 
and trigrams - to identify authors and the best result was 
around 90%. In [8], citations of a given paper were used to 
identify its authors. The data set was a very large archive 
of physics research papers (KDDCUP 2003 physics-paper 
archive). Authors were identified 40-45% of the times. In 
||2l|], authorship analysis was performed on a set of candidate 
authors who wrote on the same topics. Specifically, analysis 
was done on movie reviews of five reviewers on the same 
five movies. Although reviews similar to ours were used, 
there were significant differences. We use over 1,000,000 
reviews by ^ 2, 000 authors, whereas, only 25 reviews by 
5 authors were used in [21]. A related result 13 studied the 
problem of inferring the gender of a movie reviewer from 
his/her review. Using logistic regression |15] along with 
features derived from the writing style, content, and meta- 
data of the review, accuracy of up to 73.7% was achieved 
in determining the correct gender The goal of this study 
was clearly quite different from ours. For a comprehensive 
overview of authorship analysis studies, we refer to ll22ll . 

While all aforementioned results are somewhat similar to 
our present work, there are some notable differences. First, 
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we perform authorship identification analysis in a context 
that has not been extensively explored - user reviews. User 
reviews are generally written differently from other types of 
writing, such as email and research papers. In a review, the 
author generally assesses something and thus the text con- 
veys some evaluation and personal opinions. A review usu- 
ally conveys information about personal taste, since most 
people tend to review things of interest to them. In addi- 
tion, reviews contain other non-textual information, such as 
the ratings and categories of things being reviewed. These 
types of extra information provide added leverage; recall 
that, as discussed earlier, ratings and categories are partic- 
ularly helpful in increasing the overall linkability ratio. Sec- 
ond, our problem formulation is different. We study linkabil- 
ity of reviews (and user-ids of their authors) in the presence 
of a large number of prolific contributors where the num- 
ber of anonymous reviews could be more than one (up to 
60 reviews). Whereas, most prior work attempts to identify 
authors from a small set of authors, each with small sets of 
texts where the number of anonymous documents/messages 
is one. 

Some work has been done in recovering authors based 
on their ratings, using external knowledge. In particular, 
studied author linkability with two different databases; one 
public and the other - private. Several techniques were used 
to link authors in public forums (public) who state their opin- 
ions and rating about movies to reviewers who contribute 
to a sparse database (private) of movie ratings. A related 
result [18] considered anonymity in high-dimensional and 
sparse data sets of anonymized users. First, it presented a 
general definition and model of privacy breaches in such 
sets. Second, a statistical de-anonymization attack was pre- 
sented that was resilient to perturbation. Third, this attack 
was used to de-anonymize the Netflix 1 1] data set. Note that 
the problem formulation of these two results differs from 
ours. They studied anonymity in the presence of an exter- 
nal source of public information. Whereas, our work does 
not rely on any external sources. 

Last but not least, other related research effort assessed 
authenticity of reviews [TT] . It explored the problem of iden- 
tifying spam. Results demonstrated that spam reviews were 
prevalent and a counter-measure based on logistic regression 
was proposed. 

8. Conclusion 

Large numbers of Internet users are becoming frequent vis- 
itors and contributors to various review sites. At the same 
time, they are concerned about their privacy. In this paper, 
we study linkability of reviews. Based on a large set of re- 
views, we show that a high percentage (99% in some cases) 
are linkable, even though we use very simple models and 
very simple features set. Our study suggests that users reli- 
ably expose their identities in reviews. This has certain im- 
portant implications for cross-referencing accounts among 



different review sites and detecting people who write reviews 
under different identities. Additionally, techniques used in 
this study could be adopted by review sites to give contribu- 
tors feedback about linkabiUty of their reviews. 
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